{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Online Retail II Data Set\n",
    "\n",
    "In this notebook we will prepare and store the Online Retail II Data Set stored on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)\n",
    "\n",
    "\n",
    "**Citation:**\n",
    "\n",
    "Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n",
    "\n",
    "## Download the data\n",
    "\n",
    "- Navigate to the [data folder](https://archive.ics.uci.edu/dataset/502/online+retail+ii).\n",
    "- Download the file called **online_retail_II.xlsx**.\n",
    "- Save the Excel file into the **datasets** folder at the root of this repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If you downloaded and stored the file as explained\n",
    "# above, it should be located here:\n",
    "\n",
    "file = \"../Datasets/online_retail_II.xlsx\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The data is provided as two sheets in a single Excel file.\n",
    "# Each sheet contains a different time period.\n",
    "# Load both and join into a single dataframe.\n",
    "\n",
    "df_1 = pd.read_excel(file, sheet_name=\"Year 2009-2010\")\n",
    "df_2 = pd.read_excel(file, sheet_name=\"Year 2010-2011\")\n",
    "\n",
    "df = pd.concat([df_1, df_2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Invoice</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>Price</th>\n",
       "      <th>Customer ID</th>\n",
       "      <th>Country</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>489434</td>\n",
       "      <td>85048</td>\n",
       "      <td>15CM CHRISTMAS GLASS BALL 20 LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.95</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>489434</td>\n",
       "      <td>79323P</td>\n",
       "      <td>PINK CHERRY LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.75</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>489434</td>\n",
       "      <td>79323W</td>\n",
       "      <td>WHITE CHERRY LIGHTS</td>\n",
       "      <td>12</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>6.75</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>489434</td>\n",
       "      <td>22041</td>\n",
       "      <td>RECORD FRAME 7\" SINGLE SIZE</td>\n",
       "      <td>48</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>2.10</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>489434</td>\n",
       "      <td>21232</td>\n",
       "      <td>STRAWBERRY CERAMIC TRINKET BOX</td>\n",
       "      <td>24</td>\n",
       "      <td>2009-12-01 07:45:00</td>\n",
       "      <td>1.25</td>\n",
       "      <td>13085.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Invoice StockCode                          Description  Quantity  \\\n",
       "0  489434     85048  15CM CHRISTMAS GLASS BALL 20 LIGHTS        12   \n",
       "1  489434    79323P                   PINK CHERRY LIGHTS        12   \n",
       "2  489434    79323W                  WHITE CHERRY LIGHTS        12   \n",
       "3  489434     22041         RECORD FRAME 7\" SINGLE SIZE         48   \n",
       "4  489434     21232       STRAWBERRY CERAMIC TRINKET BOX        24   \n",
       "\n",
       "          InvoiceDate  Price  Customer ID         Country  \n",
       "0 2009-12-01 07:45:00   6.95      13085.0  United Kingdom  \n",
       "1 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  \n",
       "2 2009-12-01 07:45:00   6.75      13085.0  United Kingdom  \n",
       "3 2009-12-01 07:45:00   2.10      13085.0  United Kingdom  \n",
       "4 2009-12-01 07:45:00   1.25      13085.0  United Kingdom  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Inspect dataframe\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "\n",
    "df.columns = [\n",
    "    \"invoice\",\n",
    "    \"stock_code\",\n",
    "    \"description\",\n",
    "    \"quantity\",\n",
    "    \"invoice_date\",\n",
    "    \"price\",\n",
    "    \"customer_id\",\n",
    "    \"country\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Process data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove null customer ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = ~df[\"customer_id\"].isnull()\n",
    "df = df[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a flag for when an order is cancelled. Cancelled orders contain \n",
    "the letter `C` at the start of the invoice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"is_cancelled\"] = df[\"invoice\"].apply(lambda x: str(x)[0] == \"C\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove transactions which are negative quantities sold and are not cancelled orders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = ~(~df[\"is_cancelled\"] & df[\"quantity\"] < 0)\n",
    "\n",
    "df = df[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute revenue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"revenue\"] = df[\"quantity\"] * df[\"price\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To compute gross revenue and quantity sold we filter out cancelled orders.\n",
    "\n",
    "After this, we resample the data at a weekly level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = ~df[\"is_cancelled\"]\n",
    "\n",
    "# If running this raises an UnsupportedFunctionCall error\n",
    "# try upgrading your version of pandas.\n",
    "df_gross = (\n",
    "    df.loc[mask, [\"invoice_date\", \"quantity\", \"revenue\", \"country\"]]\n",
    "    .groupby(\"country\")\n",
    "    .resample(\"W\", on=\"invoice_date\")\n",
    "    .sum(numeric_only=True)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_gross.index.rename([\"country\", \"week\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Save data\n",
    "\n",
    "We will save 3 different versions of the preprocessed dataset for different demos.\n",
    "\n",
    "## Weekly sampled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_gross_countries = df_gross.reset_index(level=\"country\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = [\n",
    "    'United Kingdom',\n",
    "    'Belgium',\n",
    "    \"EIRE\",\n",
    "    'Germany',\n",
    "    \"France\",\n",
    "    'Spain',\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_gross_countries[df_gross_countries[\"country\"].isin(countries)].to_csv(\n",
    "    \"../Datasets/online_retail_dataset_countries.csv\",\n",
    "    index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unstacked countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df_gross.unstack(\"country\")[\"revenue\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>country</th>\n",
       "      <th>Australia</th>\n",
       "      <th>Austria</th>\n",
       "      <th>Bahrain</th>\n",
       "      <th>Belgium</th>\n",
       "      <th>Brazil</th>\n",
       "      <th>Canada</th>\n",
       "      <th>Channel Islands</th>\n",
       "      <th>Cyprus</th>\n",
       "      <th>Czech Republic</th>\n",
       "      <th>Denmark</th>\n",
       "      <th>...</th>\n",
       "      <th>Singapore</th>\n",
       "      <th>Spain</th>\n",
       "      <th>Sweden</th>\n",
       "      <th>Switzerland</th>\n",
       "      <th>Thailand</th>\n",
       "      <th>USA</th>\n",
       "      <th>United Arab Emirates</th>\n",
       "      <th>United Kingdom</th>\n",
       "      <th>Unspecified</th>\n",
       "      <th>West Indies</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>week</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2009-12-06</th>\n",
       "      <td>196.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>439.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>989.18</td>\n",
       "      <td>760.69</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1008.00</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>435.88</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>141.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>213000.35</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-13</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1429.83</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>412.60</td>\n",
       "      <td>285.3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>517.7</td>\n",
       "      <td>195810.04</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-20</th>\n",
       "      <td>75.0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>2796.29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>429.66</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1952.64</td>\n",
       "      <td>0.0</td>\n",
       "      <td>589.4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>182396.74</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2009-12-27</th>\n",
       "      <td>0.0</td>\n",
       "      <td>568.51</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5149.06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>22007.77</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2010-01-03</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 41 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "country     Australia  Austria  Bahrain  Belgium  Brazil  Canada  \\\n",
       "week                                                               \n",
       "2009-12-06      196.1      NaN      NaN    439.1     NaN     NaN   \n",
       "2009-12-13        0.0  1429.83      NaN      8.5     NaN     NaN   \n",
       "2009-12-20       75.0     0.00      NaN      0.0     NaN     NaN   \n",
       "2009-12-27        0.0   568.51      NaN      0.0     NaN     NaN   \n",
       "2010-01-03        0.0     0.00      NaN      0.0     NaN     NaN   \n",
       "\n",
       "country     Channel Islands   Cyprus  Czech Republic  Denmark  ...  Singapore  \\\n",
       "week                                                           ...              \n",
       "2009-12-06           989.18   760.69             NaN  1008.00  ...        NaN   \n",
       "2009-12-13             0.00     0.00             NaN     0.00  ...        NaN   \n",
       "2009-12-20             0.00  2796.29             NaN   429.66  ...        NaN   \n",
       "2009-12-27             0.00     0.00             NaN     0.00  ...        NaN   \n",
       "2010-01-03             0.00     0.00             NaN     0.00  ...        NaN   \n",
       "\n",
       "country       Spain  Sweden  Switzerland  Thailand    USA  \\\n",
       "week                                                        \n",
       "2009-12-06   435.88     NaN          NaN       NaN  141.0   \n",
       "2009-12-13   412.60   285.3          NaN       NaN    0.0   \n",
       "2009-12-20  1952.64     0.0        589.4       NaN    0.0   \n",
       "2009-12-27  5149.06     0.0          0.0       NaN    0.0   \n",
       "2010-01-03     0.00     0.0          0.0       NaN    0.0   \n",
       "\n",
       "country     United Arab Emirates  United Kingdom  Unspecified  West Indies  \n",
       "week                                                                        \n",
       "2009-12-06                   NaN       213000.35          NaN          NaN  \n",
       "2009-12-13                 517.7       195810.04          NaN          NaN  \n",
       "2009-12-20                   0.0       182396.74          NaN          NaN  \n",
       "2009-12-27                   0.0        22007.77          NaN          NaN  \n",
       "2010-01-03                   0.0            0.00          NaN          NaN  \n",
       "\n",
       "[5 rows x 41 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "y.to_csv(\"../Datasets/online_retail_dataset.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Raw data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# columns needed for demo\n",
    "cols = [\"invoice_date\", \"description\", \"revenue\"]\n",
    "\n",
    "# just UK\n",
    "df = df[df[\"country\"] == \"United Kingdom\"]\n",
    "\n",
    "# save\n",
    "df[cols].to_csv(\"../Datasets/online_retail_dataset_all.csv\", index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
